ML - Final Assigment

Yael Ohayon - 312542558

First Mission - Clustring

Use one or more of the clustering algorithms we discussed in class to cluster together artists based on similarities. Usually, we use unsupervised learning in the earlier stages of the project. Discuss the results and support your claim in at least one plot (in addition to the clustering plot) This graph may relate to your predictions or incorporate any information from an outside source (please mention explicitly any source you used as your help).

In the follwing pargraph I will explain what I did in order to clustering the data

Stage I: Pre-Processing:

As also explained in requirements I did some pre-processing which is neccery to analyze the data later. For begining I used loop iteration over all images given in train and validation folders and using the python package for image processing - openCV,I resized all images to 100 * 100 size, so from now on - all the analyze made over picture of same dimension.The function I used can be found in this notebook at last.

Later In order to answer the question above I choose to cluster the data according to the given label of artists origin

How I did it?

First because the requirements asked to plot graphs - I knew I have to use 2 or 3 dimensions - because more than that is not possible for plotting. So - I needed to reduce the image direction, which although was minimized to 1001003 (because each pixel holds 3 color) It still alot!

I read about image processing and dimension reduction and desided to use PCA (Principal Componenet Analysis) in order to reduce image dimension.

Why PCA?

PCA create a new axis - which is linear combination of other features that explain most of data variance . Mathematiclly we search after the combintaion that project over it, as new axis, will give us the highest variacne of original data compare to all other projection.

Pay attention - Due to the high cost (in time and memory) of processing the data I created all tables as numpy or pandas object (pickles) that are already in this folder and loaded for anlyzing (but not created here) - in order to check the code you can run the comment lines - the data is genereted by this code.

Sources I used: https://www.pyimagesearch.com/2016/08/08/k-nn-classifier-for-image-classification

PCA invesigation We know we can't really use so much features, but although we reduce the dimention for sure we first wand to evalute - how this reduction effect data? how it's look like? In the following graph (1 - Projection Onto PCA Subspace) we can see each image in 3d dimension -describes as dot for each image - describes as PC1, PC2 and PC3 linear combination. Moreover - each dot color is presenter by the artist origin. Although it's preety - we can't understand much by now.

We should think and evalute out new sub-space - how good those PC? in the manner of - how well projection over it described the data well (how much of the variance it keeps?)

As we learned and discussed in class the proporation between the eigenvalue and sum of all eigenvalue of the features matrix - gives as the proporatinal variance its explaine. We can see the absolute value of the eigenvalue and the proporation in the second graph - (2) PCA explained variacne.

In out data our top 3 PC explain in total 45% of data variance, that's cool! instead of 1001003 features that describe 100% of pic - we need only 3 to decsribe 45% of variance.

In the next graph we evalute PCA reduction for the next mission - cheking out how many feature to use for data classification, when plotting is not needed

So next we need do decide - how much number of componenets to use? Well, we know that adding componenet will make analyze harder, so we want to reduce the number of components but on the other hand to keep explained variance high as we can.

The following code check component number from 1 to 200 in jump of 5, As we can see 100 components give us 80% variance explain and 200 components give us only 85% variance explain - It's mean that adding 100 components add only 5% variance explanation.

So - it the next steps we will analyze the data using only 100 componenets

So, what's next - next we will use the top two components in order to cluster the data using KNN!

Due to the fact KNN is a-parametric model, we will use whole X and whole y (and not train, test and validation as we will use in the next mission)

we will do that with diffrent neighbers counting - in order to get the best score and to get sense of how the number of neighbers change the clustering map.

So - what can we learn from above clustering?

First it's nice to see the Bias - Variance Trade Off, We can see that increaseing the number of neighbers make the Dutch classification less sagnificant (Bias raise, variance decresed), while when we use low number of neighbers became more significant.

This shows the bias - variance tradeoff for KNN due to the fact that using more neighbers increse bias and decreade variance and the opposite dot less neighbers.

We also see that french artist rule the centre of PC1 and PC2 whie US rule the edges - mostly the right-up cornet and left up corner.

We can see The dutch classification isn't clearly seperated of the French and US, we can point that we don't have many Dutch artists, actually we have only one, which might be not enough for this kind ot mission.

Morover I tried to figure out if we can classify artist by their dominant color, I tried to sum up the first 1/3 values of the weight in the component first and second vector - in order to get the "weight" over the red color, and to sum up the next 1/3 values to get the weight for the green and blue values (next 1/3 for green and next 1/3 for blue)

we can see that there is no most dominant color (at least for two major component we use), unfortunately, it doesn't helped much!

Second Mission - Classification

In this part we will use features matrix we get from projection over top 400 componenets of PCA.

Build two classification models in order to predict the painter from the paint. The goal here is to make a good prediction. Please include explanations on the process of developing your models. Be as clear and descriptive as you can be.

So, I thought it could be nice just trying multiple models for classification and check out which is the best, By now I will just try to get some sense of how it's look like and next I will choose two of the models and using validation data set I will try to imrove them. So I tried several models that I searched for in - https://scikit-learn.org/stable/supervised_learning.html, my attempts can be viewed in the last section of this notebook

I - Random Forest Model

Model Descripation:

Random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

I choose this model because trees calssifier tend to "break" down the data in a greedy way that maximize score over train set, I thought I could be usefull and fit image processing.

Moreover I thought that due to the fact random forest use kind of "bagging" in a way it make de-correlation over data while growing the tree (it's choose the k cordinate randomly) could help us with this specific data- seems to by very correlated.

Present the tuning process. Alongside your description, add a table with hyper-parameters and their corresponding accuracies on the training and CV datasets, ordered by the CV accuracy in a decreasing order. Show only the best 15 combination; that is, the table should consist 15 rows max.

In the table above we can see all the tuning parameters I mentioned and both train, test and validatio score. We should pay attention that we "cropped" the table - and kept only 15 lined with best validation.

We can see that there is several combination gives best prediction over validation data set. for those combination we can point on kind of trade - off between tree -depth and number of estimators and min samples split in the manner that we can get the best result also by increasing tree depth (from 9 to 17/19) and also incresing the number of estimators and min samples split. It makes sense because this parameters have opposite effect on the bias-variance trade off, so in total it keeps the result quite the same.

So In order to evalute the preformenct of the Random Forest Classifier I foucs first on tuning several of hyper- parameters, I created a table of all combination of all hyper parameter and for evry combination checked the train, validation and test score. The parameter I choose to to use in the tuning process are those I find that can be tuned in the sklearn documantaion can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

  1. max depth - The maximum depth of the tree. We know this affect bias-variacne Trade-off in trees high depth -> low bias and high variance.

  2. number of estimators - The number of trees in the forest.

  3. min samples split - The minimum number of samples required to split an internal node.

  4. min samples leaf - The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  5. max samples - If bootstrap is True, the number of samples to draw from X to train each base estimator.

I choose this parameters because this are the main parameter of the model that can get diffrent values - and we can try tune them. For every one of the parameters the range I used considered also the default value by sklearn package.

Evaluate your performance using the tools from the class.

We will evaliute our classification using ROC curve, Pay attention it's a bit more complex using multiple classes.

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

We can learn from the result that over Test That although we get poor results for both classifiation and test, using TPR and FPR methods

But, pay attention to big diffrent in the ROC curve compare to next evalutaion I show, ROC uses the Probability to get specific label - it's not "binary", it count also how much out algorithm is "close" to true value. For Example - it might be that we predict image wrong, because the wrong artist got higher probability, but it was only a bit more then the correct artists probability, so we want to take this "close" to correct answer in acount!

So - In manner of ROC curve, we aren't that bad! (Although from the binary point of view we are...) Because there is many lines (ROC curve for each artist - we will use AUC parameter - because this parameter measure of the ability of a classifier to distinguish between classes (here - diffrent artists)

Explore your predictions. Which paintings were misclassified? Why?

Load X_test and y_test and test your model. -- I talked with David and used splitting train folder given by kaggle into test and train insted.

In order to answer that question I will use the best model given by the table above, using:

We can see that the artist that the model had the best classification over their artwork are Matisse, Monet and Renoir. The artist we had low sucess in classification over it are Degas,Hassam and Cezanne.

It's intresting to understand why those specific artist are missclassified, In order to do that we can search after the tree greedy steps in order to see what happend and in which final "box" the artists appear. This will be very hard,so another option that will get the same effect ist to check the Probability to get one of the low misclassificatied artits and to get sense why the algorithm "confused" about it. This what I will try to do next (I gave only one of the artists exapmle- because it's similar idea for everyone else)

We can see, for example that although degas get the highest rate to classify for it's artwork(14%) It's not really that clear, and it's very close to the chance to get his art work as Renoir or Gauguin who get (11 and 10% respectily)

Model II - SVC

Model Descripation: Generlly, SVM is an algorithm creates a line or a hyperplane which separates the data into classes. The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

Present the tuning process. Alongside your description, add a table with hyper-parameters and their corresponding accuracies on the training and CV datasets, ordered by the CV accuracy in a decreasing order. Show only the best 15 combination; that is, the table should consist 15 rows max.

Just as before - we tried to improve classification by tuning several parameters, This are the non-boolean parameters of the model

Evaluate your performance using the tools from the class.

Again we wil evaluter preformenct with ROC curve, we can see the values are quite similar to those of random forest.

Explore your predictions. Which paintings were misclassified? Why?

same as before - now only for prediction made by the SVM model.

Top artists score classification:

Top artists score classification:

We alredy saw, that the reson of that misclassification is similarity between diffrent artist works.

Discuss the differences between your models, in their assumptions, and explain why did you choose them. Consider manipulating your data to see if it helps you achieve better results.

I used also this source - https://datascience.stackexchange.com/questions/6838/when-to-use-random-forest-over-svm-and-vice-versa

We alredy talked about each one of the model sepretly, described it and showed model paramerter tuning effect over data. For example, we talked about the bias-varince tradeoff and the paramertes affect this in each of the models (mainly tree depth in random forest and C regularizaion parameter in SVM) We also saw that the models are the same in the way they "view" the problem - due to two of them belong to the field of Supervised learning. I choosed this models because first, while trying many other models from SKLEARN - they did the best, Second, this models are "rich" of paramerter can be scaled - so in this kind of problem, I though it would be good. Moreover we might be not that suprised this models worked because, in some manner - they quite similar, and both of them "split" the subspace created by feature into "boxes" (Tree) or sub-subspaces(SVM). I tried couple of ideas in order to manupulate data -such as not scaling data, using grey-scale histogram or red, blue green histogram splitted, but, none of this ideas really works (example of histogram I used given in the bottom of the notebook) I guess that people who are expert in Image processing could have beteer ideas based on more complex theory

SVM VS RANDOM FOREST -

Criteria SVM Random Forest
Fits multiclass problems? No, we get probability from distance calculation Yes
Data scaling? Need No need
Complexity? High, due to n x n matrix Low
When to use? problem might not be linearly separable handle large number of training example,handle non-linear data
What's common Supervised Supervised

How does model performance? Discuss.

For both model we sae low results for validation and tests set, But if we investigate the result a bit more we can see, using ROC curve that in manner of probability - the classification isn't that bad, and although it's wrong in "botoom line" - it's quite close to corrct results.

Unfortunalty, We reached only ~31% rate of succee for validation and test sets in, both random forest ans SVM.

Because I tried many other models, eventhough it's not greate, it's better then other.

I think the main reson the models aren't very good is that is that some of artwork are quite the same - or very similar. Also, I think that maybe using thet data "as is" with no manipulation (beside resize, that this step only re-scale it but don't change it) maybe cause this low rate of success. Moreover I believe that maybe high level of applications in Deep Learning, that build specifly in order to image processing and classification can help us imporve Results.

Function I used